Loading Libraries and Reading Data

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(knitr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(sf)
## Warning: package 'sf' was built under R version 4.4.2
## Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(leaflet)
knitr::opts_chunk$set(
    echo = TRUE,
    warning = FALSE,
  fig.width = 6,
  fig.asp = .6,
  out.width = "90%"
)

theme_set(theme_minimal() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis"
)

scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
gun_df = read.csv("data/mass_shootings_2018_2024_cleaned.csv") |>
  janitor::clean_names()

Structure and Size of Data

The dataset contains information on gun-related incidents from 2018 to 2024 across different cities and states in the United States. There are a total of nrow(gun_df) observation and ncol(gun_df) variables.

  • date: Date of the incident (class: Date)
  • city: City where the incident occurred (class: character)
  • state: State where the incident occurred (class: character)
  • dead: Number of fatalities (class: integer)
  • injured: Number of injuries (class: integer)
  • total: Total number of victims (Dead + Injured) (class: character)
  • description: Textual description of incident (class: character)
glimpse(gun_df) 
## Rows: 3,932
## Columns: 7
## $ date        <chr> "2024-11-18", "2024-11-17", "2024-11-17", "2024-11-16", "2…
## $ city        <chr> "Denver", "New Orleans", "Columbus", "Walthall County", "R…
## $ state       <chr> "Colorado", "Louisiana", "Ohio", "Mississippi", "Californi…
## $ dead        <int> 0, 0, 0, 1, 0, 3, 1, 3, 1, 5, 0, 1, 0, 1, 2, 0, 3, 5, 0, 0…
## $ injured     <int> 4, 10, 4, 4, 4, 1, 4, 1, 3, 0, 4, 12, 5, 3, 2, 5, 3, 0, 4,…
## $ total       <int> 4, 10, 4, 5, 4, 4, 5, 4, 4, 5, 4, 13, 5, 4, 4, 5, 6, 5, 4,…
## $ description <chr> "Four men were wounded at a strip mall following a dispute…

Comparing Number of Incidents by Year, Quarter, Month, and Weekday

First, we will compare the number of incidents by year.

gun_df = gun_df |>
  mutate(date = as.Date(date), 
         year = year(date)) 

gun_plot = gun_df |>
  ggplot(aes(x = as.factor(year), fill = factor(year))) +
  geom_bar() +
  scale_y_continuous(
    labels = scales::comma,
    limits = c(0, NA),  
    expand = expansion(mult = c(0, 0.15))  
  ) +
  geom_label(stat = "count", aes(label = after_stat(count), y = after_stat(count)), vjust = -0.2) +
  labs(
    title = "Number of Gun-Related Incidents by Year",
    x = "Year",
    y = "Number of Incidents"
  ) +
  theme(legend.position = "none")

print(gun_plot)

The graph shows a general upward trend in gun-related incidents from 2018 to 2022. The peak would be at 717 incidents. The year 2020 shows a significant increase, rising from 441 incidents in 2019 to 617 incidents, and the trend continues with 703 incidents in 2021. However, from 2023 to 2024, the number of incidents appear to decline, with 596 incidents in 2023 and 538 incidents in 2024. This suggests that while the number of incidents has increased over the past few years, the decline in recent years might reflect a shift in trends or external factors impacting the data. Further analysis would be needed to understand the variations in detail.

We will now compare the number of incidents by quarter for each year.

gun_df = gun_df |>
  mutate(date = as.Date(date, format = "%Y-%m-%d"),
         year = year(date),
         quarter = quarter(date)
  )

gun1_plot = gun_df |>
  ggplot(aes(x = factor(quarter), fill = factor(year))) +
  geom_bar(position = "dodge") + 
  labs(
    title = "Quarterly Number of Gun-Related Incidents (2014-2024)",
    x = "Quarter",
    y = "Number of Incidents",
    fill = "Year"
  )
print(gun1_plot)

Let’s further break it down by years and quarters.

q1 = gun_df |>
  filter(year != 2013) |>
  select(year, quarter) |>
  group_by(year) |>
  count(quarter) |>
  ggplot(aes(x = as.factor(quarter), y = n, fill = factor(quarter))) + 
  geom_bar(stat = 'identity') + 
  scale_y_continuous(labels = scales::comma) + 
  facet_grid(. ~ year) + 
  labs(x = 'Quarter by Year', y = 'Number of incidents') +
  theme(legend.position = "none")

q2 = gun_df |>
  filter(year != 2013 & quarter == 1) |>
  select(year, quarter) |>
  group_by(year) |>
  count(quarter) |>
  ggplot(aes(x = as.factor(year), y = n, fill = factor(year))) + 
  geom_bar(stat = 'identity') + 
  scale_y_continuous(labels = scales::comma) + 
  labs(x = 'Incidents in Q1 of each year', y = 'Number of incidents') +
  theme(legend.position = "none")

gridExtra::grid.arrange(q1, q2)

The dataset on incidents of gun violence from 2018-2024 show some seasonality. We notice incidents in Q1 and Q4 are generally lower than those in Q2 and Q3. The second graph highlights that Q1 incidents increased steadily from 2018 to 2021, with smaller increases observed between 2021 and 2022, as well as between 2022 and 2023, suggesting a leveling-off trend during this period. In 2024, there is a slight decrease compared to 2023, but it is important to consider that 2024 is not yet complete. This potential plateau in Q1 incidents could signal a stabilization, though the overall trend since 2018 indicates a significant increase that still warrants attention.

We will now compare the number of incidents by months.

gun_df$month <- month(gun_df$date, label = TRUE)

gun2_plot = plotly::ggplotly(
  gun_df |>
    filter(year != c(2013, 2018)) |>
    count(month) |>
    ggplot(aes(x = month, y = n, fill = month)) + 
    geom_bar(stat = 'identity') + 
    scale_y_continuous(labels = scales::comma) + 
    labs(x = 'Month', y = 'Number of incidents', title = 'Cumulative Incidents by Month') + 
    theme(legend.position = "none")
) 

gun2_plot

The graph shows the number of incidents by month. Again, we see a clear seasonal trend. Incidents steadily rise from January through July, peaking in July (likely due to heightened activity during the summer months). A sharp decline follows from August to December, with November and December showing the lowest numbers. This could be influenced by colder weather, holiday periods, or shorter months (i.e., February). The visible trends suggest a relationship between weather (seasonal variations), social behavior, and incident rates. You can hover over the bars for exact values to compare these trends in more detail.

Let’s see the dates with the most incidents.

gun_df$day = day(gun_df$date)  
gun_df = gun_df |> mutate(date2 = paste(month, day)) 

knitr::kable(
  gun_df |> 
    filter(year != c(2013, 2018)) |> 
    count(date2) |> 
    top_n(10) |> 
    arrange(desc(n)) |> 
    rename(date = date2, "total number of incidents" = n)
)
## Selecting by n
date total number of incidents
Jul 5 49
Jul 4 48
Jan 1 31
Jun 11 26
Jul 28 23
Jun 17 23
Jun 20 22
Jun 23 22
Jul 23 21
May 18 21

The table highlights the dates with the highest total number of incidents, aggregated over 2018-2024. July 5th and July 4th stand out with 49 and 48 incidents respectively, indicating that Independence Day celebrations significantly contribute to increased incidents, likely due to festivities and the use of fireworks or firearms. January 1st, with 31 incidents, suggests that New Year’s Day also brings heightened risk, potentially linked to post-midnight celebrations and gatherings. The remaining dates predominantly occur in late June and July, with totals ranging from 21 to 26 incidents, reflecting a seasonal peak in incidents during summer months. May 18th, while slightly earlier, still aligns closely with this summer trend, reinforcing the pattern of increased incidents during warmer months. These results show the influence of holidays and seasonal trends on incident rates.

We will now compare the number of incidents by weekday.

gun_df$weekday = wday(gun_df$date, label = TRUE)

gun_df |> 
  count(weekday) |> 
  ggplot(aes(x = weekday, y = n, fill = weekday)) + 
  geom_bar(stat = 'identity') + 
  scale_y_continuous(labels = scales::comma) + 
  labs(x = 'Weekday', y = 'Number of incidents', title = 'Cumulative Incidents by Weekday') +
  theme(legend.position = "none")

Sunday stands out as the most dangerous day, with incidents far exceeding 900. Saturdays follow closely with around 900 incidents. Fridays show a moderate increase with about 450 incidents, marking the start of the weekend.Meanwhile, weekdays see significantly lower numbers, with Tuesday recording the fewest incidents at around 200, followed by Thursday at about 350, Wednesday at 375, and Monday at 440. This trend show the influence of weekends or leisure culture on the frequency of incidents, with weekdays being relatively safer perhaps due to structured routines and fewer late-night activities.

Incidents by State

plotly::ggplotly(gun_df %>% 
                   count(state) %>%
        ggplot(aes(x=reorder(state, n), y=n, fill=n, text=state, fill = state)) +
        geom_bar(stat='identity') +
        theme(axis.text.x = element_text(angle = 80, hjust = 1)) +
        labs(x='', y='Number of incidents', title = 'Cumulative Incidents by State'),
        tooltip=c("text", "y"))

The interactive bar graph above shows absolute number of gun violence by states. Hovering over the bars shows a label with the absolute number of incidents for a state. Illinois has the highest number of gun violence cases followed by Texas and California.

Incidents Relative to State Population

statesPop <- read_csv("data/PopulationUS.csv")
## Rows: 53 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): SUMLEV, REGION, DIVISION, STATE, NAME
## dbl (3): POPESTIMATE2017, POPEST18PLUS2017, PCNT_POPEST18PLUS
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
statesPop <- statesPop %>% 
  select(NAME, POPESTIMATE2017)

statesPop <- statesPop %>% 
  filter(!NAME %in% c("United States", 
                      "Puerto Rico Commonwealth"))

statesPop <- statesPop %>% 
  rename(state=NAME)

statesPop$state <- as.factor(statesPop$state)
incidentsByState <- gun_df %>% 
  group_by(state) %>%
  count(state)

incidentsByState <- left_join(incidentsByState, 
                             statesPop, 
                             by="state")

incidentsByState <- incidentsByState %>%  
  filter(state != "Washington, D.C.") 

incidentsByState$Per100000 <- ((incidentsByState$n/incidentsByState$POPESTIMATE2017)*100000)

kable(head(incidentsByState))
state n POPESTIMATE2017 Per100000
Alabama 115 4874747 2.3590968
Alaska 6 739795 0.8110355
Arizona 46 7016270 0.6556190
Arkansas 49 3004279 1.6310070
California 297 39536653 0.7512017
Colorado 71 5607154 1.2662395
incidentsByState
## # A tibble: 52 × 4
## # Groups:   state [52]
##    state                    n POPESTIMATE2017 Per100000
##    <chr>                <int>           <dbl>     <dbl>
##  1 Alabama                115         4874747     2.36 
##  2 Alaska                   6          739795     0.811
##  3 Arizona                 46         7016270     0.656
##  4 Arkansas                49         3004279     1.63 
##  5 California             297        39536653     0.751
##  6 Colorado                71         5607154     1.27 
##  7 Connecticut             24         3588184     0.669
##  8 Delaware                17          961939     1.77 
##  9 District of Columbia    57          693972     8.21 
## 10 Florida                211        20984400     1.01 
## # ℹ 42 more rows

This table provides the numerical values for the number of incidents, 2017 population, and the scaled number of incidents per 100,000 people.

plot <- ggplot(incidentsByState, aes(x = reorder(state, Per100000), y = Per100000, fill = Per100000)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = '', y = 'Incidents per 100,000 inhabitants', title = 'Cumulative Incidents by State Relative to Population') +
  theme_minimal() +
  theme(legend.position = "none")

ggplotly(plot, tooltip = c("y", "x"))

This is the Number of Incidents scaled to state population. The order changes quite drastically where the District of Columbia, Louisiana, and Mississippi now lead and Idaho, Wyoming, and New Hampshire now have the lowest though the bottom is much more concentrated.

Interactive Map

states <- st_read(dsn = "data/cb_2017_us_state_500k/cb_2017_us_state_500k.shp", quiet = FALSE, stringsAsFactors = TRUE)
## Reading layer `cb_2017_us_state_500k' from data source 
##   `C:\Users\nadka\OneDrive\Desktop\Data Science\p8105_final\data\cb_2017_us_state_500k\cb_2017_us_state_500k.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
## Geodetic CRS:  NAD83
class(states)
## [1] "sf"         "data.frame"
kable(head(st_drop_geometry(states)))
STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD ALAND AWATER
54 01779805 0400000US54 54 WV West Virginia 00 62265662566 489840834
17 01779784 0400000US17 17 IL Illinois 00 143784114293 6211277447
24 01714934 0400000US24 24 MD Maryland 00 25150696145 6980371026
16 01779783 0400000US16 16 ID Idaho 00 214048160737 2393355752
50 01779802 0400000US50 50 VT Vermont 00 23873457570 1031134839
09 01779780 0400000US09 09 CT Connecticut 00 12542619303 1815495323
addPer100k <- data.frame(id=states$GEOID, 
                         name=states$NAME)

addPer100k <- left_join(addPer100k, incidentsByState %>% 
                          select(state, Per100000) %>% 
                          rename(name=state), by="name")

addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0

states$per100k <- addPer100k$Per100000
bins <- c(0, 0.5, 1.0, 1.5, 2.0, Inf)

pal <- colorBin("YlOrRd", 
                domain = states$per100k, 
                bins = bins)

state_popup <- paste0("<strong>State: </strong>", 
                      states$NAME, 
                      "<br><strong>Incidents per 100,000 inhabitants </strong>", 
                      states$per100k) %>% 
  
  lapply(htmltools::HTML)

leaflet(data = states) %>%
  
        setView(lng=-96, 
                lat=37.8, 
                zoom=4) %>%
  
        addProviderTiles("MapBox", options = providerTileOptions(id = "mapbox.light",
                accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN'))) %>%
  
        addPolygons(
                fillColor = ~pal(per100k),
                weight = 2,
                opacity = 1,
                color = "white",
                dashArray = "3",
                fillOpacity = 0.7,
                highlight = highlightOptions(
                        weight = 5,
                        color = "#666",
                        dashArray = "",
                        fillOpacity = 0.7,
                        bringToFront = TRUE),
                label = state_popup,
                labelOptions = labelOptions(
                        style = list("font-weight" = "normal", 
                                     padding = "3px 8px"),
                        textsize = "15px",
                        direction = "auto")) %>%
  
        addLegend(pal = pal, 
                  values = ~per100k, 
                  opacity = 0.7, 
                  title = "Incidents", 
                  position = "bottomright")

Here an interactive map which displays the same data. It is scaled via a discrete legend so the heat map is slightly different than viridis but it conveys the same data in a more engaging format.

Victims by State

VictimsByState <- gun_df %>% 
  group_by(state) %>% 
  summarize(total_victims=sum(total), 
            total_injured=sum(injured), 
            total_dead=sum(dead), 
            percent_dead=round(total_dead/total_victims,2), 
            total_incidence=n(), 
            percent_victims_per_incidents=round(total_victims/total_incidence,2))

head(VictimsByState)
## # A tibble: 6 × 7
##   state      total_victims total_injured total_dead percent_dead total_incidence
##   <chr>              <int>         <int>      <int>        <dbl>           <int>
## 1 Alabama              647           516        131         0.2              115
## 2 Alaska                26            14         12         0.46               6
## 3 Arizona              259           193         67         0.26              46
## 4 Arkansas             265           208         57         0.22              49
## 5 California          1564          1174        390         0.25             297
## 6 Colorado             382           278        104         0.27              71
## # ℹ 1 more variable: percent_victims_per_incidents <dbl>

This table provides the number of victims by state and the number of victims per incident by state. This shows how many people were involved in the average incident by state.

VictimsByState %>% ggplot(aes(x = reorder(state, 
                         -percent_victims_per_incidents), 
             y = percent_victims_per_incidents, 
             fill = percent_victims_per_incidents)) + 
  
  geom_bar(stat = 'identity') +
  
  labs(x = '', 
       y = 'Victims per incidents',
       fill = 'Percent of Total Victims per Incident', 
       title = 'Cumulative Percent of Total Victims per Incident by State') +
  
  theme(axis.text.x = element_text(angle = 80, 
                                   hjust = 1))

The bar graph above shows the number of victims per incident by state. Maine has the highest victims by incidents followed by Washington, DC followed by Puerto Rico.

VictimsByState <- left_join(VictimsByState, 
                           statesPop, 
                           by="state")

VictimsByState$Per100000 <- round((VictimsByState$total_victims/VictimsByState$POPESTIMATE2017)*100000)


plotly::ggplotly(VictimsByState %>% 
                   
                   filter(state!="District of Columbia") %>%
                   
        ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
          
        geom_bar(stat='identity') + 
        
        coord_flip() +
          
        labs(x = '', 
             y = 'Victims per 100,000 inhabitants',
             fill = 'Total Victims per Incident', 
             title = 'Cumulative Total Victims per Incident Relative to State Population') + 
          
        theme(legend.position="none"),
        tooltip=c("text", "y"))

This shows the average number of victims per incident when we scale by state. There is another dramatic reordering with Louisiana, Mississippi, and Illinois now taking the top spot. There is no data for DC.

Interactive Map

addPer100k <- addPer100k %>% select(id, name)

addPer100k <- left_join(addPer100k, VictimsByState %>% 
                          select(state, Per100000) %>% 
                          rename(name=state), 
                        by="name")

addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0

states$per100k <- addPer100k$Per100000
bins <- c(0, 5, 10, 15, 20, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)

state_popup <- paste0("<strong>State: </strong>", 
                      states$NAME, 
                      "<br><strong>Victims per 100,000 inhabitants </strong>", 
                      states$per100k) %>% lapply(htmltools::HTML)

leaflet(data = states) %>%
  
        setView(lng=-96, 
                lat=37.8, 
                zoom=4) %>%
  
        addProviderTiles("MapBox", 
                         options = providerTileOptions(id = "mapbox.light",
                accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN'))) %>%
  
        addPolygons(
                fillColor = ~pal(per100k),
                weight = 2,
                opacity = 1,
                color = "white",
                dashArray = "3",
                fillOpacity = 0.7,
                highlight = highlightOptions(
                        weight = 5,
                        color = "#666",
                        dashArray = "",
                        fillOpacity = 0.7,
                        bringToFront = TRUE),
                label = state_popup,
                labelOptions = labelOptions(
                        style = list("font-weight" = "normal", padding = "3px 8px"),
                        textsize = "15px",
                        direction = "auto")) %>%
  
        addLegend(pal = pal, 
                  values = ~per100k, 
                  opacity = 0.7, 
                  title = "Victims Per Incident", 
                  position = "bottomright")

Once again this is an interative map which helps visalize and reinforce the previous information.

Incidents by City

incidentsByCity <- gun_df %>% 
  select(city, state) %>% 
  group_by(city, state) %>% 
  summarize(cityIncidents=n())
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
incidentsByCity
## # A tibble: 1,254 × 3
## # Groups:   city [1,158]
##    city      state          cityIncidents
##    <chr>     <chr>                  <int>
##  1 Abbeville South Carolina             1
##  2 Aberdeen  Maryland                   1
##  3 Abington  Massachusetts              1
##  4 Adelanto  California                 1
##  5 Adelphi   Maryland                   1
##  6 Aguanga   California                 1
##  7 Aiken     South Carolina             2
##  8 Akron     Ohio                       6
##  9 Alachua   Florida                    1
## 10 Alameda   California                 1
## # ℹ 1,244 more rows

Here are the top number of incidents by city.

incidents_nystate <- incidentsByCity[incidentsByCity$state == 'New York', ] 

incidents_nystate
## # A tibble: 24 × 3
## # Groups:   city [24]
##    city         state    cityIncidents
##    <chr>        <chr>            <int>
##  1 Albany       New York            10
##  2 Bay Shore    New York             2
##  3 Buffalo      New York            17
##  4 Copiague     New York             1
##  5 Freeport     New York             1
##  6 Hempstead    New York             2
##  7 Lockport     New York             1
##  8 Mount Vernon New York             1
##  9 New City     New York             1
## 10 New Rochelle New York             1
## # ℹ 14 more rows

Here are the top number of incidents per city within New York State.

incidents_nyc <- incidentsByCity[(incidentsByCity$city %in% c('New York City')) & 
                                   incidentsByCity$state=='New York',]

incidents_nyc
## # A tibble: 1 × 3
## # Groups:   city [1]
##   city          state    cityIncidents
##   <chr>         <chr>            <int>
## 1 New York City New York            91

And here are the number of incidents per New York City. For a city of 8 million people this seems quite low. This suggest that crime has fallen dramatically, crime is under-reported, or that the data collection mechanism is selective.

threshold <- incidentsByCity %>%
  pull(cityIncidents) %>%
  sort(decreasing = TRUE) %>%
  nth(50) 

top_cities <- incidentsByCity %>%
  filter(cityIncidents >= threshold) %>% 
  
  arrange(desc(cityIncidents))  

top_cities %>%
  ggplot(aes(x = reorder(city, cityIncidents), 
             y = cityIncidents, 
             fill = cityIncidents)) + 
    geom_bar(stat = 'identity') +
    labs(x = '', 
         y = 'Number of Incidents', 
         title = 'Top 50 Cities by Number of Incidents',
         fill = 'Number of Incidents') +
    coord_flip()

The bar graph above shows the number of incidents by city. Chicago has the highest number of incidents followed by Philadelphia followed by New York City.